Regresssion with scikit-learn

using Soccer Dataset

We will again be using the open dataset from the popular site Kaggle that we used in Week 1 for our example.

Recall that this European Soccer Database has more than 25,000 matches and more than 10,000 players for European professional soccer seasons from 2008 to 2016.

Note: Please download the file database.sqlite if you don't yet have it in your Week-7-MachineLearning folder.

Import Libraries

import sqlite3
import pandas as pd 
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

Read Data from the Database into pandas

# Create your connection.
cnx = sqlite3.connect('database.sqlite')
df = pd.read_sql_query("SELECT * FROM Player_Attributes", cnx)

df.head()

df.shape

df.columns

Declare the Columns You Want to Use as Features

features = [
       'potential', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes']

Specify the Prediction Target

target = ['overall_rating']

Clean the Data

df = df.dropna()

Extract Features and Target ('overall_rating') Values into Separate Dataframes

X = df[features]

y = df[target]

Let us look at a typical row from our features:

X.iloc[2]

Let us also display our target values:

y

Split the Dataset into Training and Test Datasets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

(1) Linear Regression: Fit a model to the training set

regressor = LinearRegression()
regressor.fit(X_train, y_train)

Perform Prediction using Linear Regression Model

y_prediction = regressor.predict(X_test)
y_prediction

What is the mean of the expected target value in test set ?

y_test.describe()

Evaluate Linear Regression Accuracy using Root Mean Square Error

RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))

print(RMSE)

(2) Decision Tree Regressor: Fit a new regression model to the training set

regressor = DecisionTreeRegressor(max_depth=20)
regressor.fit(X_train, y_train)

Perform Prediction using Decision Tree Regressor

y_prediction = regressor.predict(X_test)
y_prediction

For comparision: What is the mean of the expected target value in test set ?

y_test.describe()

Evaluate Decision Tree Regression Accuracy using Root Mean Square Error

RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))

print(RMSE)